174 research outputs found
Mapping the spatio-temporal dynamics of vision in the human brain
Recognition of objects and scenes is a fundamental function of the human brain, necessitating a complex neural machinery that transforms low level visual information into semantic content. Despite significant advances in characterizing the locus and function of key visual areas, integrating the temporal and spatial dynamics of this processing stream has posed a decades-long challenge to human neuroscience. In this talk I will describe a brain mapping approach to combine magnetoencephalography (MEG), functional MRI (fMRI) measurements, and convolutional neural networks (CNN) by representational similarity analysis to yield a spatially and temporally integrated characterization of neuronal representations when observers perceive visual events. The approach is well suited to characterize the duration and sequencing of perceptual and cognitive tasks, and to place new constraints on the computational architecture of cognition. In collaboration with: D. Pantazis, R.M Cichy, A. Torralba, S.M. Khaligh-Razavi, C. Mullin, Y. Mohsenzadeh, B.Zhou, A. Khosl
Interpreting Deep Visual Representations via Network Dissection
The success of recent deep convolutional neural networks (CNNs) depends on
learning hidden representations that can summarize the important factors of
variation behind the data. However, CNNs often criticized as being black boxes
that lack interpretability, since they have millions of unexplained model
parameters. In this work, we describe Network Dissection, a method that
interprets networks by providing labels for the units of their deep visual
representations. The proposed method quantifies the interpretability of CNN
representations by evaluating the alignment between individual hidden units and
a set of visual semantic concepts. By identifying the best alignments, units
are given human interpretable labels across a range of objects, parts, scenes,
textures, materials, and colors. The method reveals that deep representations
are more transparent and interpretable than expected: we find that
representations are significantly more interpretable than they would be under a
random equivalently powerful basis. We apply the method to interpret and
compare the latent representations of various network architectures trained to
solve different supervised and self-supervised training tasks. We then examine
factors affecting the network interpretability such as the number of the
training iterations, regularizations, different initializations, and the
network depth and width. Finally we show that the interpreted units can be used
to provide explicit explanations of a prediction given by a CNN for an image.
Our results highlight that interpretability is an important property of deep
neural networks that provides new insights into their hierarchical structure.Comment: *B. Zhou and D. Bau contributed equally to this work. 15 pages, 27
figure
Global Depth Perception from Familiar Scene Structure
In the absence of cues for absolute depth measurements as binocular disparity, motion, or defocus, the absolute distance between the observer and a scene cannot be measured. The interpretation of shading, edges and junctions may provide a 3D model of the scene but it will not inform about the actual "size" of the space. One possible source of information for absolute depth estimation is the image size of known objects. However, this is computationally complex due to the difficulty of the object recognition process. Here we propose a source of information for absolute depth estimation that does not rely on specific objects: we introduce a procedure for absolute depth estimation based on the recognition of the whole scene. The shape of the space of the scene and the structures present in the scene are strongly related to the scale of observation. We demonstrate that, by recognizing the properties of the structures present in the image, we can infer the scale of the scene, and therefore its absolute mean depth. We illustrate the interest in computing the mean depth of the scene with application to scene recognition and object detection
Recognition of natural scenes from global properties: Seeing the forest without representing the trees
Human observers are able to rapidly and accurately categorize natural scenes, but the representation mediating this feat is still unknown. Here we propose a framework of rapid scene categorization that does not segment a scene into objects and instead uses a vocabulary of global, ecological properties that describe spatial and functional aspects of scene space (such as navigability or mean depth). In Experiment 1, we obtained ground truth rankings on global properties for use in Experiments 2–4. To what extent do human observers use global property information when rapidly categorizing natural scenes? In Experiment 2, we found that global property resemblance was a strong predictor of both false alarm rates and reaction times in a rapid scene categorization experiment. To what extent is global property information alone a sufficient predictor of rapid natural scene categorization? In Experiment 3, we found that the performance of a classifier representing only these properties is indistinguishable from human performance in a rapid scene categorization task in terms of both accuracy and false alarms. To what extent is this high predictability unique to a global property representation? In Experiment 4, we compared two models that represent scene object information to human categorization performance and found that these models had lower fidelity at representing the patterns of performance than the global property model. These results provide support for the hypothesis that rapid categorization of natural scenes may not be mediated primarily though objects and parts, but also through global properties of structure and affordance.National Science Foundation (U.S.) (Graduate Research Fellowship)National Science Foundation (U.S.) (Grant 0705677)National Science Foundation (U.S.) (Career Award 0546262)NEC Corporation Fund for Research in Computers and Communication
The Briefest of Glances: The Time Course of Natural Scene Understanding
What information is available from a brief glance at a novel scene? Although previous efforts to answer this question have focused on scene categorization or object detection, real-world scenes contain a wealth of information whose perceptual availability has yet to be explored. We compared image exposure thresholds in several tasks involving basic-level categorization or global-property classification. All thresholds were remarkably short: Observers achieved 75%-correct performance with presentations ranging from 19 to 67 ms, reaching maximum performance at about 100 ms. Global-property categorization was performed with significantly less presentation time than basic-level categorization, which suggests that there exists a time during early visual processing when a scene may be classified as, for example, a large space or navigable, but not yet as a mountain or lake. Comparing the relative availability of visual information reveals bottlenecks in the accumulation of meaning. Understanding these bottlenecks provides critical insight into the computations underlying rapid visual understanding.National Science Foundation (U.S.) (CAREER Award (0546262))National Science Foundation (U.S.) (Grant 0705677)National Science Foundation (U.S.) (Graduate Research Fellowship
Canonical views of scenes depend on the shape of the space
When recognizing or depicting objects, people show a preference for particular “canonical” views. Are there
similar preferences for particular views of scenes? We investigated this question using panoramic images, which
show a 360-degree view of a location. Observers used an interactive viewer to explore the scene and select the best view. We found that agreement between observers on the “best” view of each scene was generally high. We attempted to predict the selected views using a model based on the shape of the space around the camera location and on the navigational constraints of the scene. The model performance suggests that observers select views which capture as much of the surrounding space as possible, but do not consider navigational constraints when selecting views. These results seem analogous to findings with objects, which suggest that canonical views maximize the visible surfaces of an object, but are not necessarily functional views.National Science Foundation (U.S.) (NSF Career award (0546262))National Science Foundation (U.S.) (Grant 0705677)National Institutes of Health (U.S.) (Grant 1016862)National Eye Institute (grant EY02484)National Science Foundation (U.S.) (NSF Graduate Research Fellowship
Artifact magnification on deepfake videos increases human detection and subjective confidence
The development of technologies for easily and automatically falsifying video
has raised practical questions about people's ability to detect false
information online. How vulnerable are people to deepfake videos? What
technologies can be applied to boost their performance? Human susceptibility to
deepfake videos is typically measured in laboratory settings, which do not
reflect the challenges of real-world browsing. In typical browsing, deepfakes
are rare, engagement with the video may be short, participants may be
distracted, or the video streaming quality may be degraded. Here, we tested
deepfake detection under these ecological viewing conditions, and found that
detection was lowered in all cases. Principles from signal detection theory
indicated that different viewing conditions affected different dimensions of
detection performance. Overall, this suggests that the current literature
underestimates people's susceptibility to deepfakes. Next, we examined how
computer vision models might be integrated into users' decision process to
increase accuracy and confidence during deepfake detection. We evaluated the
effectiveness of communicating the model's prediction to the user by amplifying
artifacts in fake videos. We found that artifact amplification was highly
effective at making fake video distinguishable from real, in a manner that was
robust across viewing conditions. Additionally, compared to a traditional
text-based prompt, artifact amplification was more convincing: people accepted
the model's suggestion more often, and reported higher final confidence in
their model-supported decision, particularly for more challenging videos.
Overall, this suggests that visual indicators that cause distortions on fake
videos may be highly effective at mitigating the impact of falsified video.Comment: 8 pages, 4 figure
Quadri-stability of a spatially ambiguous auditory illusion
In addition to vision, audition plays an important role in sound localization in our world. One way we estimate the motion of an auditory object moving towards or away from us is from changes in volume intensity. However, the human auditory system has unequally distributed spatial resolution, including difficulty distinguishing sounds in front vs. behind the listener. Here, we introduce a novel quadri-stable illusion, the Transverse-and-Bounce Auditory Illusion, which combines front-back confusion with changes in volume levels of a nonspatial sound to create ambiguous percepts of an object approaching and withdrawing from the listener. The sound can be perceived as traveling transversely from front to back or back to front, or “bouncing” to remain exclusively in front of or behind the observer. Here we demonstrate how human listeners experience this illusory phenomenon by comparing ambiguous and unambiguous stimuli for each of the four possible motion percepts. When asked to rate their confidence in perceiving each sound’s motion, participants reported equal confidence for the illusory and unambiguous stimuli. Participants perceived all four illusory motion percepts, and could not distinguish the illusion from the unambiguous stimuli. These results show that this illusion is effectively quadri-stable. In a second experiment, the illusory stimulus was looped continuously in headphones while participants identified its perceived path of motion to test properties of perceptual switching, locking, and biases. Participants were biased towards perceiving transverse compared to bouncing paths, and they became perceptually locked into alternating between front-to-back and back-to-front percepts, perhaps reflecting how auditory objects commonly move in the real world. This multi-stable auditory illusion opens opportunities for studying the perceptual, cognitive, and neural representation of objects in motion, as well as exploring multimodal perceptual awareness.United States. Dept. of Defense (National Defense Science and Engineering Graduate (NDSEG) Fellowships
- …